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The  x'esearch  reported  on  here  was  done  at  the  University  of 
California  at  Santa  Cruz  with  partial  support  from  Project  Genie 
at  the  University  of  California  at  Berkeley  (Contract  SD-I85  with 
The  Advanced  Research  Projects  Agency  of  the  Department  of  Defense ) . 


A  method  of  coding  a  large  Pile  for  information  retrieval  ie 
discussed.  Random  superimposed  coding  of  machine  derived  "roots"  of 
the  full  \  xbulary  is  used  to  generate  an  easily  updatable  and  very 
compact  code  file.  No  thesaurus  or  dictionary  of  terms  is  needed. 

High  speed  is  made  possible  by  the  simplicity  of  the  searching  algorithm 
as  well  as  the  ability  to  moke  a  search  for  several  key  words  simulta¬ 
neously.  The  simplicity  of  the  search  facilitates  implementation  of 
the  system  on  a  small  computer  with  access  to  a  large  bulk  storage 
device . 


The  cost  of  storing  information  in  machine-accessible  form  has 
declined  markedly  in  the  last  decade,  and  promisee  ore  such  that  one 
can  look  forward  to  having  complete  libraries  available  in  such  form. 
This  places  increased  importance  on  algorithms  which  make  it  possible 
to  search  large  files  efficiently. 

This  paper  describes  on  approach  to  this  problem. 

In  practice,  information  in  y  large  file  can  be  more  efficiently 
accessed  if  it  is  indexed  in  some  manner.  The  method  of  indexing  which 
will  be  discussed  is  particularly  well  suited  for  a  file  which: 

1)  is  very  dynamic  with  both  deletions  and  additions  frequently 
occurring. 

2)  contains  on  extensive  vocabulary  which  is  to  be  encoded. 

Both  of  these  characteristics  ore  frequently  found  in  files  that  are  to 
be  coded.  A  file  of  information  on  recently  published  articles  about 
a  given  subject  and  a  cord  catalogue  for  a  large  library  are  good  ex¬ 
amples  of  files  which  require  a  large  amount  of  maintenance.  If  up¬ 
dating  the  index  (code  file)  is  expensive  and  time-consuming,  updating 
is  put  off  until  it  is  felt  that  the  performance  of  the  system  has 
deteriorated  enough  to  justify  the  effort  required  to  update  it. 

Until  the  updating  takes  place,  information  which  is  no  longer  of  use 
is  still  retrieved,  and  the  new  information,  if  present,  is  in  a 
secondary  file.  Keeping  a  secondary  file  containing  recent  additions 
avoids  the  serious  problem  of  not  having  new  material  available,  but 
it  does  decrease  the  efficiency  of  the  system  since  such  a  file  must 
be  searched  separately  each  time  an  inquiry  is  made  of  the  main  file. 

The  ability  to  utilize  an  extensive  vocabulary  is  also  very 
important.  In  the  proposed  system  the  vocabulary  to  be  used  is 
derived  directly  from  words  used  in  the  original  documents,  thereby 
eliminating  the  time-consuming  and  expensive  practice  of  manually 
abstracting  and  choosing  indexing  terms.  Machine  generated  deriva¬ 
tives  of  the  original  vocabulary  retain  more  information  about  the 
original  content  of  the  item  than  does  the  manual  system  of  assigning 
descriptors.  In  the  manual  case  when  selected  descriptors  are  as¬ 
signed  to  a  document,  associations  of  descriptors  to  words  and  to 


pliraeeo  ore  made.  Such  associations  are  not  made  In  exactly  the  same 
manner  by  two  trained  indexers,  and  it  is  likoly  that  the  associations 
made  by  the  average  interrogator  of  an  information  retrieval  system 
will  be  even  more  diverse.  Because  of  this  lack  of  uniformity  in 
assigning  descriptors  it  is  dosirable  to  allow  each  searcher  bo  deter¬ 
mine  words  and  phrases  that  he  wishes  to  associate  with  the  concept 
on  which  he  is  doing  a  search.  Postponing  such  associations  until 
the  time  of  the  search  can  be  accomplished  only  if  the  entire  word 
content  is  preserved  in  the  coded  form. 

Ease  of  update  and  freedom  of  vocabulary  are  not  enough  in 
themselves  to  make  a  coding  algorithm  worthwhile.  Factors  such  as 
speed  of  access,  ability  to  make  searches  for  combinations  of  words 
and  compactness  of  code  file  are  also  important  considerations .  All 
of  these  characteristics  will  be  discussed  for  the  coding  scheme 
discussed  below. 

THE  SYSTEM 

The  information  retrieval  system  which  was  investigated  can 
be  divided  into  three  components:  preparation  of  the  text,  generation 
of  the  code  file,  and  the  searching  procedure.  A  general  outline  of 
the  first  two  components  con  be  seen  in  Figure  1. 

Since  the  form  and  format  of  the  text  to  be  used  can  be  ex¬ 
pected  to  vary  greatly,  the  text  is  standardized  as  it  is  read  in. 

Flags  are  set  to  indicate  boundaries  between  records  as  well  as  at 
the  ends  of  lines  to  make  it  easier  to  reproduce  the  document  when  it 
is  retrieved.  Also,  as  a  measure  to  reduce  the  bulk  of  the  file 
generated  (text  file)  extra  blanks  in  the  input  text  are  removed.  In 
the  pilot  system  the  text  file  was  generated  from  two  sources:  a 
bibliography  of  computer  science  and  a  listing  of  authors  and  titles 
from  recent  issues  of  The  Computer  Group  News  of  the  IEEE.  Both  of 
uhese  texts  were  read,  processed,  and  stored  on  a  disk.  The  text 
file  generated  was  100,000  characters  stored  one  character  per  byte. 

i 

Orce  the  text  file  is  generated  coding  can  proceed.  The  text 
file  is  examined  character  by  character  until  the  end  of  a  string  which 


*  i- 

ls  io  be  coded  (word)  is  encountered.  The  unit  coded  is  a  string  of 
at  least  throe  alphabetic  characters  surrounded  by  non -alphabetic 
symbols  (an  English  word).  After  the  word  is  found  is  compared  with 
a  list  of  non-content  words,  (i.e,,  the  Delete  List  containing  words 
such  ass  of,  the,  and  etc).  If  the  word  ia  found  in  the  sc lets  List 
there  is  no  further  processing  of  that  word,  and  the  next  word  ie 
considered. 

When  a  word  ie  found  that  is  not  in  the  Delete  List.,  the  trim¬ 
ming  algorithm  is  applied  to  reduce  the  word  to  a  pseudo-root.  Common 
endings  such  as  n,  od,  ing  and  compound  endings  such  a«  ful ly  (as  in 
carefully)  are  removed.  By  removing  endings,  different  forma  of  the 
same  word  ore  made  into  synonyms .  For  example ,  the  words  'computer* 
and  ' ccxnputers '  will  both  be  reduced  to  tho  base  ' comput . 1  This 
derived  root  is  then  passed  on  bo  tho  coding  procedure,  (further 
discussion  of  trimming  algorithm  in  Appendix  C). 

In  the  coding  procedure,  a  code  word  is  generated  for  each 
record.  The  code  word  can  be  thought  of  as  a  bit  string  containing 
N  bits,  all  of  which  are  initialized  to  zero  at  the  beginning  of  the 
coding  operation.  When  a  trimmed  word  is  to  be  coded  into  the  code 
word,  the  numeric  value  of  tho  letters  in  the  trimmed  word  is  summed, 
giving  a  number  which  is  used  to  choose  an  element  from  the  uniform 
distribution  ox'  integers  between  1  and  N.  Thus  the  resultant  integer 
(code  value  of  the  word)  io  generated  by  an  algorithm  which  given  the 
same  trimmed  word  in  the  future  will  generate  the  identical  code 
value  for  that  word.  By  using  a  fixod  orittunetic  procedure  to  pro¬ 
duce  the  code  val.ue  for  a  word,  the  need  for  a  dietionary  of  words 
and  assigned  code  values  disappears.  This  frees  the  large  amount 
of  storage  which  such  ft  dictionary  would  occupy  as  well  as  saving 
the  time  required  to  search  such  a  file. 

If  for  a  particular  word  the  code  vol  te  generated  is  K,  then 
the  K'th  bit  in  the  code  word  is  set  to  one.  The  entire  operation  of 
finding  a  wor i,  checking  the  Delete  List  to  see  if  it  should  not  be 
coded,  trimming,  and  coding  is  repeated  until  the  entire  record  is 
processed,  'l'he  code  word  which  is  uniquely  determined  by  the  words 


in  the  record  is  then  stored  in  a  file  (code  i'ilo)  along  with  a  pointer 
bo  tho  beginning  of  the  record  in  the  text  file.  This  procedure  is 
repeated  until  all  the  records  have  boon  coded. 

(Figure  1) 

After  coding  the  file  is  ready  for  searching.  The  searching 
program  accepts  any  number  of  words,  each  of  which  is  processed  in 
tho  some  manner  as  the  words  in  the  text  file.  It  is  looked  for  in 
tho  Deloto  List,  trimmed,  and  used  to  generate  a  code  value.  This  code 
value  is  then  used  to  produce  a  query  code  in  exactly  the  same  way 
«a  the  code  words  woro  produced  in  the  code  file.  Upon  generation  of 
the  query  code  the  actual  search  may  begin.  Each  code  word  in  the 
code  file  ie  matched  agains;  the  query  code  to  see  if  tho  query  code 
is  a  subset  of  it.  (Here  a  bit  string  X  is  said  to  bo  a  subsot  of 
another,  Y,  if  when  the  I'th  bit  in  X  is  one,  the  I'th  bit  in  Y  is 
also  one.  i.o.,  1010  is  a  subset  of  1011  while  03.01  is  not.)  Each 
time  that  the  query  code  is  a  Subset  of  the  code  word,  the  pointer  to 
tho  text  file  is  used  to  gain  access  to  the  corresponding  record  which 
can  bo  further  processed  to  see  not  only  if  it.  contains  the  relevant 
words,  but  that  the  words  arc  in  tho  correct  order. 

The  above  is  a  brief  description  of  the  coding  suggested  for  a 
file  of  oji  information  scanning  program.  Some  details  such  as  the 
exact  procedure  for  removing  endings  and  the  use  of  several  indepen¬ 
dently  generated  code  values  to  produce  multiple  code  words  for  a 
given  record,  were  not  dealt  with  here.  A  raoro  detailed  treatment  of 
these  problems  can  bo  found  in  the  appendix. 

RESULTS 

From  the  pilot  system,  du*a  was  gained  on  the  performance  of 
such  &  system  of  superimposed  coding.  When  possible,  the  performance 
of  the  superimposed  coding  system  will  be  compared  with  that  of  a 
threaded  list  and  inverted  file,  (figures  2  and  j)  The  following 
factors  received  major  consideration: 


FIGUK2  1  Coding  Procedure 
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1}  Ease  of  updute 

2)  Effect  of  a  large  vocabulary 

3)  Amount  and  type  of  storage 

4)  Speed  of  search 

s' )  Cost 

Before  malting  any  comparisons  it  would  be  beat  to  give  u  brief  descrip¬ 
tion  of  threaded  lists  and  inverted  files.  An  Inverted  filo  consioto 
of  two  main  parts,  a  vocabulary  file  and  an  occurrence  filo.  Ao  records 
are  processed,  each  significant  wo^d  is  looked  up  in  the  vocabulary 
file.  If  the  word  lias  appeared  before,  it  has  associated  with  it  a 
pointer  to  on  area  in  the  occurrence  filo;  if  not,  then  an  area  in  the 
occurrence  file  is  set  asido  for  the  word  and  a  pointer  to  the  first 
location  in  that  area  ia  entered  in  the  vooftbulory  file.  After  this 
pointer  Is  found,  an  entry  is  made  in  the  first  free  location  in  the 
corresponding  area  of  the  occurrence  file  to  indicate  the  record  in 
which  the  word  occurred. 

(Figure  2) 

The  throaded  list  or  the  other  hand,  has  the  awae  typo  of 
vocabulary  file,  but  the  occurrence  file  is  art  uged  in  a  different 
manner.  The  pointer  in  the  vocabulary  flit  now  indicates  a  location 
associated  with  the  first  record  containing  the  given  word .  This 
location  in  the  occurrence  file,  in  turn,  contains  a  pointor  to 
another  location  in  the  occurrence  file  as  filiated  with  the  second 
record  which  contains  the  word,  and  the  pointer  in  this  location 
points. . .  Thus  a  linked  list  of  all  the  occurrences  of  the  word  is 

O 

formed.'* 

(Figure  3) 

1)  Ease  of  update 

In  the  proposed  system  a  record  can  be  added  or  deleted  very 
easily.  To  delete  a  record  a  search  is  performei  which  will  retrieve 
the  desired  document.  This  produces  not  only  the  pointer  to  the 
record  in  the  text  file  but  the  location  of  the  record’s  code  *  n  the 
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code  file.  The  code  word  and  pointer  are  removed  from  the  code  file, 
and  their  location  is  recorded  as  being  free  to  be  used  for  a  new 
entry  to  the  code  file.  The  space  that  the  text  was  occupying  in 
the  text  file  is  now  also  free  to  contain  new  text.  In  order  to  add 
a  record,  which  is  the  more  common  situation,  the  text  of  the  new 
record  is  added  to  the  text  file  in  the  first  free  location  of  a 
suitable  size,  or  at  the  end.  It  is  then  processed  in  the  same  manner 
as  all  the  other  records  have  been.  The  generated  code  word  and 
pointer  is  inserted  in  the  first  free  space  in  the  code  list.  Here 
no  room  is  wasted  since  all  of  the  code  word  and  pointer  combinations 
are  of  the  same  length.  Thus  any  type  of  update  in  the  code  file 
will  affect  only  the  code  for  the  record  which  is  being  changed. 

The  threaded  list  can  be  updated  with  slightly  more  effort. 

The  problem,  and  a  minor  one,  is  that  the  records  in  the  occurrence 
file  are  not  all  of  the  same  length,  making  it  necessary  to  see  if 
there  is  enough  room  in  a  given  free  area  to  insert  the  new  entry. 

The  inverted  file  on  the  other  hand  is  far  more  difficult  to 
update  than  either  of  the  others.  If  a  record  is  to  be  removed  all 
that  need  be  done  is  to  delete  all  pointers  to  it  in  the  occurrence 
file.  The  addition  of  a  record  however  becomes  a  serious  problem. 

If  for  every  word  in  the  record  there  is  room  for  an  additional 
pointer  in  the  areas  set  aside  for  pointers  to  records  containing 
that  word,  then  the  update  is  easy.  But  if  there  is  no  room,  a 
secondary  file  ir.u3t  be  set  up.  The  number  of  such  files  will  grow 
until  it  is  felt  that  a  thorough  update  should  be  made.  Then  the 
entire  text  file  must  be  re-inverted  to  produce  a  new  vocabulary  and 
occurrence  file.  This  is  a  very  time-consuming  and  expensive  project. 

2)  Effect  of  a  large  vocabulary 

With  che  superimposed  coding  thex-e  is  no  problem  associated 
with  having  an  arbitrarily  large  vocabulary.  This  is  true  because 
the  superimposed  coding  does  not  require  a  table  of  vocabulary  words 
like  the  inverted  and  threaded  list  files  do.  Since  the  vocabulary 
file  is  not  present  and  does  not.  have  to  be  searched,  increasing  the 
vocabulai*y  neither  lengthens  the  time  required  for  a  search  nor 


FIGURE  2 


Inverted  File 
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increases  the  amount  of  storage  required  to  contain  the  coded  informa¬ 
tion. 

3)  Storage  requirements 

The  major  advantage  of  superimposed  coding  lies  in  the  great 
economy  of  storage.  In  the  pilot  program  which  was  run,  a  text  file 
of  100, OCj  byter  was  used  to  produce  a  code  file  requiring  3,000 
bytes.  This  reduction  of  30  to  1  from  the  text  to  the  code  file  is 
far  better  than  the  ratio  obtained  with  the  threaded  list  and  in¬ 
verted  files.  Such  reductions  are  largest  with  small  files  such  aB 
the  one  experimented  with,  but  substantial  reductions  do  exist  even 
in  larger  files.  For  example,  assume  that  the  text  file  consisted  of 
10.000,000  bibliographic  entries,  each  containing  12  words  which  will 
be  coded.  Such  an  author-title  entry  was  found  to  have  roughly  300 
characters  in  it,  implying  that  the  text  file  would  be  roughly  3  x  lO^ 
characters.  Also  assume  that  an  average  search  contains  at  least 
three  significant  words.  Such  an  assumption  is  made  on  the  grounds 
that  a  search  based  on  fewer  words  would  tend  to  return  more  titles 
than  would  be  of  interest  due  to  the  very  large  size  of  the  biblio¬ 
graphy.  From  these  two  assumptions,  utilizing  considerations  ex¬ 
plained  in  Appendix  l):  it  is  found  that  the  code  file  would  consist 
of  seven  code  words  and  one  pointer  for  each  record.  Each  of  the 
code  words  is  produced  in  a  manner  similar  to  the  single  code  word 
mentioned  before.  Now,  however,  once  the  trimmed  form  of  the  word 
is  found  seveu  different  procedures  are  applied  to  produce  the  pseudo¬ 
random  number  between  1  and  N  for  each  of  the  sewn  code  words. 

Each  of  the  code  words  will  have  24  bits  and  the  pointer  will  have  32 
bits,  thus  indicating  that  each  record  will  produce  25  bytes  o'*  code 
in  the  code  file.  The  total  size  of  the  code  file  would  then  be 
2.5  x  108  bytes,  which  still  is  a  reduction  of  better  than  10  to  1. 

Such  a  reduction  is  far  out  of  reach  of  an  inverted  file  since 

each  record  in  the  text  would  have  to  have  twelve  24  bit  pointers 

pointing  to  it,  and  one  32  bit  pointer  from  the  record  to  the  startin' 

position  of  that  record  in  the  text  file.  This  requires  a  total  of 

4  x  108  bytes  and  indicates  only  a  portion  of  the  room  taken  up  b.v 
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tbe  inverted  file.  It  does  not  include  ■'■he  vocabulary  file  which 
would  be  substantial,  nor  doe.1’  it  encompass  the  overhead  of  the  occur¬ 
rence  file  consisting  of  markers  for  the  boundary  between  lists  of 
pointers  for  a  given  word.  Also  it  ignores  the  room  which  must  be 
set  aside  for  a  linking  pointer  in  case  u  new  occurrence  is  to  be 
added . 

An  additional  advantage  of  the  superimposed  coding  lies  in  the 
type  of  storage  which  can  be  used  to  store  the  code  file.  Since  the 
file  will  be  searched  serially  the  storage  media  need  not  be  random 
access.  This  permits  the  use  of  a  cheaper  sequential  access  storage 
device  such  as  magnetic  tape,  which  could  greatly  decrease  the  cost 
of  such  a  system. 

4)  Speed  of  search 

Evaluating  the  speed  of  a  search  using  superimposed  coding  is 
difficult  since  the  speed  of  any  implemented  system  depends  heavily 
on  the  characteristics  of  the  storage  media  containing  the  code  file 
as  well  as  on  the  obvious  consideration  of  the  size  of  the  text  file. 
The  search  can  be  performed  by  reading  the  code  file  from  bulk  storage 
into  addressable  memory  and  comparison  of  the  query  codes  with  code 
words  made  by  software.  If  this  is  done  then  the  time  required  to 
search  the  code  file  can  be  cut  to  less  than  6  x  (the  memory  cycle 
time  of  the  machine)  x  (the  total  number  of  code  words  in  the  code 
file.)  This  speed  can  be  achieved  due  to  the  simplicity  of  the  com¬ 
parison  which  the  software  must  make.  The  program  only  needs  to 
test  It'  X  is  a  subset  of  Y  by  loading  the  accumulator  with  Y,  doing 
a  logical  AND  of  the  accumulator  with  a  register  which  contains  X, 
and  testing  to  see  if  the  accumulator  equals  X,  When  large  text 
files  are  used,  and  there  are  several  independentiy  assigned  code 
words  for  each  record,  time  is  saved  by  being  able  to  reject  a  record 
when  any  one  of  the  query  codes  fails  to  be  a  subset  of  the  correspond¬ 
ing  code  word.  By  taking  advantage  of  this  a  substantial  amount  of 
time  can  be  saved.  In  the  previously  mentioned  large  file,  with 
seven  code  words  for  each  record  and  an  average  search  of  three  words, 
more  than  90%  of  the  records  would  be  rejected  after  oni.y  the  first 
comparison  was  made.  This  means  that  there  would  be  36  memory  cycle 


times  (the  time  allotted  for  the  6  comparisons  which  did  not  have  to 
be  made )  free  to  tans  care  of  the  overhead  in  the  searching  program. 

Sven  with  this  simple  and  fast  searching  procedure,  a  search 
does  require  longer  than  the  threaded  list  or  inverted  file.  Although 
the  implementation  of  this  technique  in  software  is  slower  there  aro 
several  methods  that  radically  reduce  the  amount  of  time  required  to 
search  the  code  file . 

Since  the  algorithm  for  searching  the  code  file  is  simple,  the 
actual  testing  to  see  if  X  is  a  subset  ol‘  Y  nan  be  done  with  very 
simple  hardware.  If  the  I'tli  bit  of  X  is  1  and  the  I'th  bit  of  Y  is 
0  for  any  of  the  values  of  I  from  V  through  7,  then  X  in  not  a  subset 
of  Y  and  the  value  of  '/,  will  be  1.  If  in  no  case  is  bit  I  of  Y=-0 
and  bit  I  of  X»-l,  then  X  is  a  subset  of  Y  and  Z  is  0. 

(Figure  4) 

Considering  the  speed  of  present  day  circuitry  the  time  required 
to  search  a  code  file  would  be  reduced  to  the  time  required  to  transfer 
the  data  from  bulk  storage.  Since  the  hardware  is  so  simple,  it  is 
practical  to  scon  data  from  several  sources  simultaneously.  An  al¬ 
ternative  to  having,  the  file  searched  externally  would  be  to  wire 
into  read  only  memory  the  commands  to  test  for  a  subset.  By  adding 
instructions  to  use  the  next  code  word  and  repeat  the  operation  if 
the  tost  fails,  the  search  will  proceed  through  core  memory  at  a 
rapid  rate  making  only  one  core  access  for  each  test.  The  end  of  the 
list  of  code  words  cun  be  marked  by  a  code  word  containing  all  ones. 
This  has  any  possible  query  us  a  subset  and  would  assure  that  the 
loop  was  interrupted  at  that  point. 

A  second  technique  which  would  reduce  the  time  required  to 
search  the  file  is  to  sort  it  in  some  manner.  One  such  method  which 
generates  a  superimposed  0  bit  code  from  24  bit  code  is  discussed  in 
Appendix  A.  Other  methods  such  as  carefully  dividing  code  file  into 
small  groups  and  then  doing  a  logical  OR  of  the  chosen  code  words 
to  form  rejector  vectors  have  been  suggested.'* 


X  Is  a  Subset  of  Y 


FIGURK  1| 

Hardware  to  Test  if  X  le  6  Subset  of  Y 

z  (fyxQ)v(Y/»x  )v...v(tyxj 
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lu  comparing  the  speed  of  tl'«  search  it  should  be  noted  that 
with  superimposed  coding  end  when  searching  for  several  words,  the 
search  for  all  of  the  words  is  carried  out  at  once.  In  the  threaded 
list  and  inverted  file  a  search  for  several  words  is  made  by  making 
a  list  of  occurrences  for  each  word  and  then  finding  the  intersection 
of  the  list6.  Due  to  this  parallelism  of  the  search  superimposed 
coding  can  handle  a  multiple  word  search  in  a  more  efficient  manner 
than  the  other  two  methods. 

At  first  glance  it  appeared  that  searching  the  entire  code 
file  would  preclude  the  use  of  superimposed  coding  on  a  large  file. 
With  more  careful  examination,  however,  it  is  apparent  that  this 
type  of  code  file  can  be  searched  as  rapidly  as  cither  the  threaded 
list  or  the  ''aster  inverted  file.  Factors  which  lead  to  this  con¬ 
clusion  include! 

A)  The  code  file  search  can  easily  be  implemented  in  hardware. 
Such  hardware  is  simple  and  very  fast  as  well  as  being  able 
to  handle  several  streams  of  data  simultaneously. 

B)  If  several  sequential  access  devices  or  a  random  access 
storage  device  is  used  then  the  code  file  may  be  structured 
to  allow  large  blocks  of  the  code  file  to  be  rejected  will) 
only  one  test. 

G)  The  super imposed  coded  file  is  much  more  efficient  at 
handling  searches  l'or  rocordG  containing  several  desired 
keys. 

5)  Cost 

The  cost  of  implementing  an  information  retrieval  system 
utilizing  the  type  of  superimposed  coding  suggested  would  be  sub¬ 
stantially  less  than  the  cost  of  implementing  a  threaded  list  or  in¬ 
verted  file  using  the  same  text  file.  The  reasons  for  this  stem  from 
the  reduced  requirement  for  computational  capability  of  the  computer, 
as  well  as  a  substantial  reduction  in  the  amount  of  storage  required 
for  the  coded  information. 

All  throe  systems  must  doeiento  a  largo  amount  or  storage  to 
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the  actual  text.  This,  in  all  of  the  cases,  can  be  either  directly 
accessible  to  the  computer  such  as  a  large  disk  file,  or  may  be  only 
machine  referable  ouoh  as  a  mao hi.. o  oontrollable  microfilm  display, 
like  the  proposed  Gystom  at  the  University  of  California,  Santa  Cruz 
or  the  one  being  uuud  as  part  of  Project  Intrex  at  The 

difference  of  storage  cost  la  not  found  in  the  storage  of  the  text 
file  but  in  the  comparison  of  the  cost  of  the  storage  of  the  code 
file  of  the  superimposed  coding  system  with  the  co6o  of  storing  the 
vocabulary  and  occurrence  files  of  the  threaded  list  and  inverted 
file.  The  oefe  file  is  smaller  and  can  be  stored  in  a  sequential 
access  dovioe  rather  than  a  random  access  device.  Doth  of  theso 
factors  tend  to  reduce  the  cost  of  the  system. 

If  soanning  of  the  code  file  is  implemented  in  hardware  then 
the  requirements  on  the  computer  become  very  small.  All  that  it  is 
responsible  for  is  processing  the  words  in  the  inquiry  in  order  to 
generate  the  query  codes,  and  then,  while  the  search  is  in  progress, 
stand  by  to  store  the  pointers  to  the  text  file  which  the  one  or, 
possibly  several,  hardware  scanners  pass  to  it. 

The  trial  program  which  processed  the  questions,  generated 
the  query  codes  and  handled  the  searching  in  software,  was  sub¬ 
stantially  under  16,000  bytes  of  code  on  an  IBM  1130  with  no  over¬ 
laying.  Thus  the  requirement  for  expensive  core  storage  is  low. 

The  cost  of  the  hardware  which  would  do  the  testing  for  the  query 
code  being  a  subsot  of  the  code  word  and  its  interfacing  with  the 
computer  would  be  very  small  compared  to  the  cost  of  the  necessary 
storage  devices. 

One  phenomenon  which  is  found  in  the  superimposed  coding  and 
not  in  some  other  forms  of  coding  is  the  presence  of  spurious  matches. 
These  occur  because,  in  a  given  code  word  the  fact  that  the  I'th  bit 
is  zero  signifies  that  any  word  assigned  the  code  value  I  is  not  In 
the  record.  The  converse  is  not  true.  Since  many  vocabulary  words 
would  cause  the  I'th  bit  to  be  one,  the  I'th  bit  being  equal  to  one, 
docs  not  indicate  that  a  specific  word  is  present.  By  generating 
several  independent  code  words  for  each  record  the  number  of  times 
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that  superimposing  will  aauoe  an  irrelevant  record  to  be  retrieved 
can  bo  made  arbitrarily  small ■  Tale  for  example  the  case  where 
twelve  words  wore  ceded  into  seven  <J4  bit  code  words.  In  that  case 
the  probability  that  a  record  in  which  all  aovon  of  the  query  codes 
for  a  question  wore  a  subset  of  the  code  words,  and  none  of  the  three 
words  involved  in  the  search  were  in  the  given  record  wao  j  x  10’10. 
(l3eo  formula  in  Appendix  li,  bd« .  35 ,  ew«*S.8,  qc*7) 

Since  the  number  of  such  spurious  mate ho a  con  bo  limited  to 
any  desired  extent,  although  not  entirely  eliminated,  it  is  con¬ 
venient  to  perform  soma  final  verifying  operation  to  assure  that  the 
words  specified  in  the  search  are  actually  prosont.  This  verifica¬ 
tion  in  the  case  of  the  pilot,  program  was  accomplished  ftB  a  side 
result  of  the  chock  to  see  that  the  desired  words  occurred  in  the 
specified  order.  Consequently  there  was  no  penalty  in  making  this 
extra  check  on  tho  records  which  were  retrieved. 

The  requirement  that  additional  checking  be  done  is  not  an 
unreasonable  one.  Tho  fact  that  a  document  contains  the  words  in 
which  one  is  interested  doea  not  necessarily  indicate  that  the  docu¬ 
ment  is  of  interest.  Therefore  any  key  word  searching  procedure  can 
only  be  the  first  step  of  an  information  retrieval  system.  Tho  .job 
of  a  key  word  search  is  to  quickly  reject  records  that  do  not  contain 
information  of  interest.  In  this  sense  any  of  the  three  types  of 
key  word  information  retrieval  systems  which  have  been  mentioned  aro 
more  properly  information  screening  procedures  which  can  rapidly 
eliminate  a  large  portion  of  the  text,  file  as  unlikely  to  contain 
relevant  information.  Such  a  system  should  be  used  to  identify  those 
records  which  warrant  further  and  more  extensive  examination. 

CONCLUSION 

The  method  of  superimposed  coding  which  has  been  discussed  is 
a  simple  and  relatively  inexpensive  manner  of  scanning  a  large  text 
file.  With  a  simple  check  for  spurious  matches  made  after  the  search, 
such  a  system  can  stand  alone  as  a  key  word  information  retrieval 
system.  On  the  other  hand  since  the  actual  scanning,  of  the  ’.ext 
can  be  easily  and  rapidly  handled  by  (»erlplieral  hardware  tin  w.’Vimd 
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is  very  attractive  ao  a  firot  8  tone  aorooning  method.  Although  the 
prospect  of'  having  to  search  the  cntiro  codo  filo  for  every  Inquiry; 
at  first  glance,  appears  discouraging,  the  simplicity  of  the  scanning 
algorithm  ami  the  oaso  with  which  soarohea  con  ho  carried  out  in 
parallel  makes  such  a  linear  search  very  reasonable. 


■i 
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APPENDDC  A 

Besides  implementation  in  hardware,  measures  can  be  taken  to 
eliminate  tho  need  Tor  searching  the  entiro  code  file,  thue  reducing 
the  required  search  time.  One  manner  of  doing  this  is  to  uco  the 
/’irst  coda  word  of  ouch  record  to  generate  a  shortened  code  word  for 
it.  In  the  case  of  a  24  bit  code  word,  the  first  three  blta  oi'  the 
aocond  lovel  codo  word  is  the  logical  OR  of  the  firBt  throe  bits  of 
the  first  lovel  code  word.  Bits  4  through  6  could  also  be  ORed  and 
used  ae  tho  second  bit  of  tho  Bccond  level  code  word.  Continuing 
this  process  an  0  bit  second  level  code  word  is  produced  based  on 
tho  bits  1  through  24  of  the  original  oode  word.  Since  there  are 
only  2?b  of  those  second  level  codes  possible,  with  each  rooord’B 
first  aodo  word  being  mapped  into  one  and  only  one  of  these  classes, 
the  file  is  partitioned  into  2p6  sets  characterized  by  the  numbers 
0  through  2bt>.  Whon  lb  is  time  to  searoh  the  code  file,  the  element 
of  tho  partition  that  the  first  query  code  belongs  to  is  determined. 

If  for  example  the  query  code  is  0001000000 100Q0001000000  it  would 
belong  to  set  01  (01010100).  Tho  only  sots  which  would  have  to  be 
searched  would  be  those  characterized  by  numbers  winch  have  84  as  a 
subset.  (i.Q.,  111.11111,  11111110,  11111100  would  have  to  bo  searched, 
bub  11111011  would  not  have  to  he  examined  further.)  There  would  be 
only  32  out  of  tho  !?.»(»  Gels  which  would  have  to  be  searched,  thus  the 
number  of  codo  words  which  would  have  to  be  compared  with  the  quary 
codes  would  be  reduced.  Using  the  scheme  of  coding  12  words  into  ?4 
bits  would  cause  roughly  10$  of  the  coda  file  to  bo  classified  as 
2b‘_>  (llllllll)  and  just  over  3$  to  bo  classified  by  a  number  whose 
binary  representation  contains  7  ones  and  one  zero,  hue  to  tho  non- 
uniform  distribution  of  the  code  words  over  the  ?><■  setu,  the  reduction 
in  tho  amount  of  the  code  file  to  bo  searched  would  not  bo  the  f/H 
suggested  by  the  reduction  in  tho  number  of  sets  which  mu3t  be  searched. 
Tho  reduction  would,  however,  bo  in  the  neighborhood  of  jOft.  (j/h  of 
the  sets  whose  binary  representation  hao  seven  ones  and  one  zero  and 
18/20  of  those  with  six  ones  and  two  zeros  can  be  eliminated.) 
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Since  care  was  taken  to  assign  the  code  values  using  numbers 
from  a  uniform  distribution,  the  expected  number  of  spurious  match' s 
can  be  predicted.  By  varying  the  length  and  number  of  the  code 
words  the  frequency  of  spurious  matches  can  be  controlled.  The  number 
of  spurious  matches  is  a  function  of  the  bit  density,  bd  (i.e.,  the 
number  of  ones  in  the  code  word  divided  by  the  number  of  hj.ts  in  the 
code  word);  the  number  of  code  words  per  record,  cw;  the  number  of  ones 
in  the  query  code,  qa;  and  the  number*  of  records  which  are  coded 
into  the  code  file  N. 

/  \  CW  X  QC 

The  expected  number  of  spurious*  matches  -  Nx(bd) 

The  number  of  bits  used  to  code  one  record  «  cw  x  (the  number 
of  bits  in  the  code  word) 

By  keeping  the  umber  of  bits  used  and  the  number  of  ones  in  a  code 
word  constant  in  the  above  two  equations,  it  is  found  that  the  minimum 
number  of  spurious  matches  occurs  when  the  number  of  bits  in  the  code 
word  is  e  times  the  number  of  ones  in  the  code  word.  That  is  when  the 
bit  density  is  1/e.  The  number  of  bits  B  to  use  for  the  code  word 
when  there  are  M  words  to  be  coded  in  each  record  is  roughly  2.2M. 

This  is  found  by  considering  that  the  probability  that  a  given  position 
will  be  left  blank  is  (l-l/B)  .  The  expected  bit  density  would  then 
be  l-(l-l/B)^.  Setting  this  equal  to  the  l/o  and  solving  for  B  yields 
the  desired  results. 
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The  trimming  program  was  divided  into  three  sections.  The  first 
step  removes  ail  'e 'a,  'd'e  and  's's  from  the  end  of  the  word.  These 
letters  were  removed  since  there  are  many  words  such  as  'attractions* 
which  have  compound  endings  terminating  in  s,  es,  d,  and  ed.  By  re¬ 
moving  these  letters,  in  the  above,  the  suffix,  'tion',  is  left  on 
the  end  of  the  word  where  it  can  be  easily  identified  and  removed  in 
a  later  section  of  the  program.  Once  this  operation  is  completed  the 
endings  'er*  then  'al'  and  then  'ly*  are  searched  for  and  removed  if 
found.  This  procedure  removes  endings  such  as  the  'ally*  on  the  end 
of  'functionally*  and  again  is  a  technique  to  handle  compound  endings. 

After  the  above  two  trimmings  have  been  accomplished,  the  Trim 
List  is  consulted.  Suffixes  found  in  the  Trim  List  are  arranged  in 
order  by  length,  starting  with  the  longest.  The  ending  found  in  the 
list  is  compared  letter  by  letter  with  corresponding  letters  on  the 
end  of  the  word  remaining  afU>'  the  fi’  t  two  trimming  stages  have 
been  completed.  Since  all  of  the  's's,  'e’s  and  'd's  have  been  re¬ 
moved,  the  suffixes  are  in  an  unusual  form.,  For  example,  'ness*  would 
have  been  trimmed  to  ' n*  by  the  first  stage  of  the  trimming  procedure. 
Also  'ance'  appears  as  'anc'  in  the  Trim  List. 

The  reason  for  having  suffixes  in  this  form  can  be  seen  by 
considering  the  problem  of  trimming  the  two  words  ‘finance*  and 
'financed'.  In  the  second  case,  when  the  'ed'  is  found  on  the  end 
of  the  word,  it  is  difficult  to  decide  if  the  'ed'  or  just  the  'd' 
should  be  removed.  The  decision  was  made  to  remove  the  'ed'.  This 
means  that  to  trim  'financed',  'anc'  must  be  in  the  Trim  List. 

However,  'finance'  which  should  be  reduced  to  the  same  pseudo-root 
requires  either  the  ending  'ance'  to  appear  in  the  list  or  the  'e' 
removed  before  the  ending  is  compared  with  endings  in  the  Trim  List. 

The  second  course  of  action  was  chosen  because  it  reduces  the  length 
of  the  Trim  List  and  makes  the  first  step  of  the  trimming  operation 
very  simple. 

The  comparison  of  the  endings  in  the  Trim  List  is  continued 
until  either  the  list  is  exhausted  or  a  match  is  found  and  the  ending 
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removed.  There  are  two  more  cheeks  to  be  made  on  the  trimmed  word. 
First,  the  lasu  two  letters  of  the  word  are  compared.  If  they  are 
the  same,  then  the  last  letter  is  removed.  This  is  done  so  that 
a  word  such  as  'trimming'  will  be  cut  back  to  'trim'.  First  the  ' ing ' 
is  removed  to  give  ’trimm*  and  then  the  second  'm'  removed  to  give  the 
desired  root. 

The  final  action  provides  some  protection  against  trimming 
words  too  severely.  The  word  'deeds'  would  be  trimmed  to  nothing. 

To  prevent  such  loss  of  information,  any  word  which  has  been  reduced 
to  less  than  three  letters  is  restored  to  a  length  of  three.  At  this 
point  the  word  is  considered  trimmed. 

(Figure  h) 

There  i'  major  problem  which  occurs  with  the  use  of  a 
trimming  algor it lim.  Words  which  do  not  convey  the  same  meaning  cun 
be  reduced  to  the  same  root.  An  example  would  be  that  both  'informa¬ 
tion'  and  ’ informal*  are  reduced  to  'inform'.  Such  a  result  may  be 
undesirable;  it  is  unlikely  that  when  searching  for  one  of  the  words, 
the  other  would  be  of  interest.  Unfortunately  the  effect  of  this  type 
of  false  retrieval  could  not  be  observed  in  the  small  pilot  program. 
Such  confusion  of  terms  was  rare  due  to  the  specialized  nature  of  the 
text.  In  a  system  utilizing  a  larger  text  file  containing  a  more 
generalized  vocabulary,  the  number  of  3uch  erroneous  replies  may 
become  substantial.  If  a  system  utilizing  a  trimmed  form  of  the 
vocabulary  words  *r  used  for  the  first  stage  of  on  information  re¬ 
trieval  system,  the  problem  of  such  extra  records  is  not  a  serious 
one,  since  the  purpose  of  the  search  is  to  locate  information-rich 
sections  of  the  text.  Further  examination  would  determine  whether 
the  record  is  of  interest  or  net. 

The  decision  to  utilize  a  trimming  algorithm  in  the  pilot  pro¬ 
gram  was  based  on  the  feeling  that  the  error  of  failing  to  retrieve 
information  was  less  tolerable  than  retrieving  some  irrelevant  informa¬ 
tion. 
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